Sample-Optimal Identity Testing with High Probability
نویسندگان
چکیده
We study the problem of testing identity against a given distribution (a.k.a. goodness-of-fit) with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0 < ε, δ < 1, we wish to distinguish, with probability at least 1 − δ, whether the distributions are identical versus ε-far in total variation (or statistical) distance. Existing work has focused on the constant confidence regime, i.e., the case that δ = Ω(1), for which the sample complexity of identity testing is known to be Θ( √ n/ε). Typical applications of distribution property testing require small values of the confidence parameter δ (which correspond to small “p-values” in the statistical hypothesis testing terminology). Prior work achieved arbitrarily small values of δ via black-box amplification, which multiplies the required number of samples by Θ(log(1/δ)). We show that this upper bound is suboptimal for any δ = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Θ ( 1 ε2 (√ n log(1/δ) + log(1/δ) )) for any n, ε, and δ. For the special case of uniformity testing, where the given distribution is the uniform distribution Un over the domain, our new tester is surprisingly simple: to test whether p = Un versus dTV (p, Un) ≥ ε, we simply threshold dTV (p̂, Un), where p̂ is the empirical probability distribution. We believe that our novel analysis techniques may be useful for other distribution testing problems as well. ∗Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship. †Supported by the NSF under Grant No. 1420692. ‡Supported by the NSF Graduate Research Fellowship under Grant No. 1122374, and by the NSF under Grant No. 1065125.
منابع مشابه
Wasserstein Identity Testing
Uniformity testing and the more general identity testing are well studied problems in distributional property testing. Most previous work focuses on testing under L1-distance. However, when the support is very large or even continuous, testing under L1-distance may require a huge (even infinite) number of samples. Motivated by such issues, we consider the identity testing in Wasserstein distanc...
متن کاملOptimal Testing for Properties of Distributions
Given samples from an unknown distribution p, is it possible to distinguish whether p belongs to some class of distributions C versus p being far from every distribution in C? This fundamental question has received tremendous attention in statistics, focusing primarily on asymptotic analysis, and more recently in information theory and theoretical computer science, where the emphasis has been o...
متن کاملDifferentially Private Testing of Identity and Closeness of Discrete Distributions
We study the fundamental problems of identity testing (goodness of fit), and closeness testing (two sample test) of distributions over k elements, under differential privacy. While the problems have a long history in statistics, finite sample bounds for these problems have only been established recently. In this work, we derive upper and lower bounds on the sample complexity of both the problem...
متن کاملTesting Bayesian Networks
This work initiates a systematic investigation of testing high-dimensional structured distributions by focusing on testing Bayesian networks – the prototypical family of directed graphical models. A Bayesian network is defined by a directed acyclic graph, where we associate a random variable with each node. The value at any particular node is conditionally independent of all the other nondescen...
متن کاملFourier-Based Testing for Families of Distributions
We study the general problem of testing whether an unknown discrete distribution belongs to a given family of distributions. More specifically, given a class of distributions P and sample access to an unknown distribution P, we want to distinguish (with high probability) between the case that P ∈ P and the case that P is ǫ-far, in total variation distance, from every distribution in P . This is...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Electronic Colloquium on Computational Complexity (ECCC)
دوره 24 شماره
صفحات -
تاریخ انتشار 2017